A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
نویسندگان
چکیده
So far, there has not been a comparative evaluation of different approaches for text extraction from scholarly figures. In order to fill this gap, we have defined a generic pipeline for text extraction that abstracts from the existing approaches as documented in the literature. In this paper, we use this generic pipeline to systematically evaluate and compare 32 configurations for text extraction over four datasets of scholarly figures of different origin and characteristics. In total, our experiments have been run over more than 400 manually labeled figures. The experimental results show that the approach BS-4OS results in the best F-measure of 0.67 for the Text Location Detection and the best average Levenshtein Distance of 4.71 between the recognized text and the gold standard on all four datasets using the Ocropy OCR engine.
منابع مشابه
Automated Data Extraction from Scholarly Line Graphs
Line graphs are ubiquitous in scholarly papers. They are usually generated from a data table and often used to compare performances of various methods. The data in these figures can not be accessed. Manual extraction of this data is hard and not scalable. On the other hand, automated systems for such data extraction task is not yet available. We report an analysis of line graphs to explain the ...
متن کاملAn Architecture for Information Extraction from Figures in Digital Libraries
Scholarly documents contain multiple figures representing experimental findings. These figures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such figures. Our architecture consists of the following modules: 1. An extractor for figures and associated metadata (figure captions and mentions) from PDF documents; 2. A Search...
متن کاملDeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes De...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملارائه روشی برای استخراج کلمات کلیدی و وزندهی کلمات برای بهبود طبقهبندی متون فارسی
Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...
متن کامل